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Abstract 

We develop a general technique for bounding the tail of the total variation distance 
between the empirical and the true distributions over countable sets. Our methods sharpen 
a deviation bound of Devroye (1983) for distributions over finite sets, and also hold for 
the broader class of distributions with countable support. We also provide some lower 
bounds of possible independent interest. 

1 Introduction 

Establishing conditions and rates for the convergence of empirical frequencies to their expected 
values is a central problem in statistics. For concreteness, let X be an N-valued random 
variable distributed according to p = (pi,p%, ■ ■ •) and let Xi,X2, ... ,X n be n independent 
copies of X. The canonical estimator for pj is obtained via the maximum likelihood principle, 
which just amounts to a normalized frequency: 



p 

n 



The weak law of large numbers guarantees that p^ — > pj in probability for all j G N. 

J n->oo 

The Chernoff-Hoeffding bound P ^ p^ — pj > e^j < 2exp(— 2ne 2 ), together with the Borel- 
Cantelli lemma, strengthens the convergence to be almost sure, thus establishing a strong 
law of large numbers. A uniform strong law of large numbers is provided by the Dvoretzky- 
Kiefer-Wolfowitz inequality (6j [10] 

F n {i) - F(i) > e ) < 2exp(-2ne 2 ), e > 0,n G N, 
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where F n (i) = Ylj<iPj and F ( i ) = J2j<iPj- Indeed, since py' 
Pj = F(j) — F(j — 1), we have 



F n (j) ~ F n (j - 1) and 



P 



(n) 



Pj 



and therefore 



We conclude that 
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< 



P 



(F n (j)-F n (j-l))-(F(j)-F(j-l)) 
F n (j) - F(j) + F n (j - 1) - F n (j - 1) 



> e) < 4exp(-ne 2 /2), e > 0. 



oo n— >-oo 

An even stronger observation is that 



almost surely (again, Borel-Cantelli is invoked). 
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almost surely. The l\ distance is 



in some sense the most natural one over distributions [7], since by Scheffe's identity [5], 

2 sup \p(E)-q(E)\ = Wp-qh, 

ECN 

for any two distributions p, q over N (for this reason, t\ is also referred to as the total variation 
distance). Almost-sure convergence in l\ may be surmised from Sanov's theorem [21 [3] - 
whose drawback, however, is that it does not readily yield explicit, analytically tractable 



estimates for P 



> £ 



Actually, Sanov's theorem guarantees that p^ — > p in yet a stronger sense, which may 
be called complete convergence in l\. Complete convergence was introduced in [8j. Applied 



to the random variable 



it means that 



Jn 



> e) < oo 



n=l 



for all e > 0. For pGl* (that is, distributions with support of size k), one may combine the 
Chernoff-Hoeffding and the union bounds to obtain the following rough estimate: 

P(J n > e) < 2kexp(-2ne 2 /k 2 ). (2) 

Though crude, ([2]) suffices to establish the complete convergence in l\ of p^ to p for distri- 
butions with finite support. A significant improvement is given by [H Lemma 3], which may 
be stated as follows: 

Lemma 1 (Devroye). For p G M. k , we have 

P(J n > e) < 3exp(-ne 2 /25), 



E > y / 20k/ 



n. 



However, for p £ 1R N with infinite support, neither ([2]) nor Lemma[T]is applicable. Our goal 
in this paper is to establish analogues of Lemma Q] for distributions with countable support. 
As a by-product, we improve Devroye's Lemma, sharpening the constant in the exponent by 
an order of magnitude. 
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2 Main results 



Our basic work-horse is McDiarmid's inequality [TT], which implies that whenever Xi, i = 
1, . . . , n, are independent N- valued random variables and h : N — )■ R is 1-Lipschitz with respect 
to the Hamming metric^, we have 

P(h(Xi,...,X n ) >Bh(X 1 ,...,X n ) + ne) < exp(-2ne 2 ), n<EN,e>0. (3) 

We choose h to be the function mapping a sample (X±, . . . ,X n ) to the l\ deviation of the 
empirical frequencies from their expected values: 



h(X 1 ,...,X n ) = 



-3} 



i=l 



In the notation above, h(Xi,...,X n ) = nJ n . Since h is 2-Lipschitz under the Hamming 
metric (Lemma [7]), it follows from ([3]) that 

P(J n > EJ n + e) < exp(-ne 2 /2), n G N, e > 0. (4) 

(In fact, this estimate is near-optimal, as follows from an argument in the spirit of [H Theorem 
!]•) 

Hence, the crux of the matter is to bound EJ n . For p £ M. k , it turns out that EJ n < yjk/n, 
which implies our first result: 

Theorem 2. For every k € N, distribution p£l k , and sample size n, 
P{Jn >e) <exp I -'- I e- A /- ) I, £> 




Observe that for e > y/20k/n, Theorem [2] yields P{J n > s) < exp(— 0.3ne 2 ), thus improv- 
ing Lemma [TJ 

Our technique works just as well for p £ M N with infinite support. Indeed, as we show in 
Lemma [U 

VnEJ n <^2^p-=:u(p), n£N. (5) 

When the right-hand side of ([5]) is finite (as is the case for "most" common distributions), the 
following result provides a simple and informative bound: 

Theorem 3. When v{p) is finite, 

P{J n > n- x l 2 v(p) + e) < exp(-ne 2 /2), n G N, e > 0. 



1 The Hamming metric is denned by d(x,y) = Y^i=\ -"-{an^i/i} ^ or x i V G 
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When v{p) is infinite, we can still extract meaningful bounds, albeit with a bit more effort. 
As we show in Lemma [9J 

EJ„ <a„(p)+/3„(p), (6) 

where 

a n (p) = 2 ^ pj, n (p) = — ^ VpJ- ( 7 ) 

Pj<l/n Pj>^/n 

At its most general, our result has the following form: 
Theorem 4. For all distributions p G M N , 

(i) P(J n > a n + fa + e) < exp(-ne 2 /2), n € N, e > 0. 

(ii) a n + /3 n — > 

n— >oo 

/ m ) the rate of decay in (ii) may be arbitrarily slow. 

The bound in Theorem [40) may be rendered effective by our control over a n and j3 n for 
specific distribution families. Moreover, our estimate in ([6]) for EJ n in terms of a n and j3 n is 
nearly tight, in the following sense: 

Proposition 5. For all n >2 and all distributions p G K N ; 

w T . Un + Pn 1 
&Jn > - A /=• 

/n 



Remark. To keep the expressions simple, we have chosen 1/n as the break-point in defining 
a n and /3 n . We note in passing that a minor improvement in the constants is achieved by the 
(optimal) break-point l/4n. 

The lower bound on EJ„ follows directly from the lemma below, in which the first in- 
equality may be of independent interest: 

Lemma 6.IfY~ Bin(n,p), then 

\Jnp(\ -p)/2 < E \Y - np\ < yjnp(\ -p), n>2, p G [1/n, 1 - 1/n]. 

3 Proofs 

We state the following elementary fact without proof: 

Lemma 7. Suppose n G N and p G M N is a distribution. Define h : N n — > R by 



h(x) = 



i=l 



x G N n . 

Then h is 2-Lipschitz with respect to the Hamming metric. 
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Lemma 8. Suppose n£N and p G K N is a distribution. Then 

VnEJ n < sf¥~y 

Proof. Let Yj ~ Bm(n,pj). Then 

(E \Yj — npj\) 2 < E(Yj- — npj) 2 = npj(l — pj) < npj, 

whence 



E \Yj - n Pj \ < yj npj(l - pj) < Jrvpj- (8) 

Since 

nEJ n = J2E\Yj - n Pj | , (9) 

the claim follows. □ 
Lemma 9. Suppose n G N and p 6 R N is a distribution. Then 

EJ n < a n + /3 n . 

Proof. As in the proof of Lemma EJ let Yj ~ Bin(n,pj) and use ([9j) to obtain 

nEJ n = Y E\Yj-npj\+ ^ E|3^-npy|. (10) 

Pj<l/n V]^yl n 

By (JH|), the second term on the right-hand side of (fTUl) is clearly upper-bounded by n(3 n (p). 
To bound the first term, we appeal to the mean absolute deviation formula for the binomial 
distribution [9] 

E\Yj -n Pj \ = 2(l-p^^p\ n ^ +1 ([n Pj \ + 1) ( Lnj Jj + 1 ), (H) 
which simplifies to 

E |ij - npj| = 2n(l - pj) n pj < 2npj, pj < 1/n. (12) 

This shows that the first term on the right-hand side of (|10p is upper-bounded by na n (p) and 
proves the claim. □ 



Proof of Theorem^ We claim that 



EJ n <\-. (13) 
V n 



Indeed, by Lemma El 

k 



v^Jn <J2vpj- ( l4 ) 
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Define x S R fc by xj = ^/pj and recall that 

k 

= IMIi < vrc [| a^N 2 = yk, (15) 

i=i 

which yields (|13p . In view of ©, this implies the theorem. □ 

Proof of Theorem 0. Immediate from and Lemma El □ 

Lemma 10. Let p 6 M N 6e a distribution. Then 

a n (p) + p n (p) — > 0. 

n— >oo 

Proof. The decay of a n (p) to zero is obvious, since it is the tail of a convergent series. To 
prove that 

lim -L V v^J = 0, (16) 

Pj>l/n 

we define the function a : N — )■ 2 N by 

cr(n) = {j G N : Pj > 1/n}. 

Since as in (fT5|h 

^ ^ vV(«)l> 

Pj>l/n 

it suffices to show that 

|<r(n)| = o(n). 

Suppose, to the contrary, that there exist a c > and an increasing sequence (n^^L-L such 
that 

|c(n)| > cnfc, k > 1. 

Put no = 1. Passing to a subsequence, we may assume that > 2nk-i/c for every k > 1. 
Now 

oo 

i = E« 

J'=l 

oo 

^ E E » 

fc=l JL< p ,<_JL_ 

oo ^ 

> E (i°"( n fc)i - k("*-i)i) • — 

Wit 

fc=i k 

oo ^ oo 

> E( cn ^ ~ cn fc/ 2 ) — = E 2 = °°' 

k=0 Hk k=0 

The contradiction completes the proof. □ 
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Lemma 11. For any rate sequence 1 > T\ > r% > . . . "\ 0, there is a distribution p G M 
such that 

a n (p) +Pn(p) > r n , nGN. 

Proof. It suffices to show that there is no rate sequence bounding a n . But this is obvious, 
since a n may be expressed as the tail of a series converging to 2 — and although any such tail 
must decay to zero, the rate may be arbitrarily slow. In particular, given some rate sequence 
(r n ), to ensure that ^2 Pj >i/ n Pj < 1 — r n for each n G N, we may choose the appropriate pj in 
an iterative greedy fashion, for n = 1, 2, . . .. □ 

Proof of Theorem ^ Item (i) is an immediate consequence of © and ([6]) . Items (ii) and (iii) 
are the contents of Lemmas [10] and [TT| respectively. □ 

Proof of Lemma® The upper bound is contained in ([8]) — and in fact, holds for all p. To 
establish the lower bound, let us rewrite the mean absolute deviation formula (jlip as 

B\Y-np\ = 2k(^jp k {l-p) n - k+ \ {k= [npj +1). 

Denote the right-hand side by E(n,k,p), and put G(n,k,p) = 2E(n, k,p) 2 /(p(l — p)). The 
left-hand inequality in the lemma is equivalent to the claim 

G(n,k,p)>n, p 6 [1/n, 1 — 1/n], k = [np] + 1. (17) 

The domain where (117ft is to be proved may be reparametrized by the inequalities 

k - 1 k 

2<k<n-l, <P<~- 

n n 

Now the function G(n, k, •) is increasing on [(k — l)/n, (2k — l)/2n] and decreasing on [(2k — 
l)/2n, k/n] — and hence we need only consider the endpoints p = (k — l)/n and p = k/n. 

To examine the first possibility, we take p = (k — l)/n and seek a k that minimizes 
G(n, k, (k — l)/n). To this end, we consider the inequality G(n, k+1, k/n) > G(n, k, (k—l)/n), 
which is equivalent (after a routine calculation) to 

k \^ / n _A; + lx2n-2fc+l 



> —T- ■ (18) 



k — 1 J \ n 

Since the function f(x) = (1 + l/x) 2x+1 is monotonically decreasing on [l,oo), the in- 
equality (I18p holds whenever k < (n + l)/2. We conclude that G(n, k, (k — l)/n) is minimized 
at the smallest allowed value of k, which is k = 2. We easily verify that the inequality 
G(n, 2, 1/2) > n is equivalent to 8(n — l) 2n_1 > n 2 " -1 for all n > 2, which again follows from 
the monotonicity of (1 + l/x) 2x+1 . 

The second case, p = k/n, is analyzed in an exactly analogous manner. □ 

Proof of Proposition® Let n > 2 and Yj ~ Bin(n,pj). We group the probabilities as follows: 
S 1 = {j : p,j < 1/n}, S 2 = {j : 1/n < pj < 1/2} and S3 = {j : pj > 1/2}. By {T2]) and Lemma 

m 



ElYj-npjl > - 



1 J npj, j G S 



2 \ y/npj, j G Si 
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Now 



na. 



{p)=Y J 1np j <±Y J K\Y j -np j 



and 



2 \/^ 



< 




< 



jes 2 

2 ^ E \Yj -npj\ + 
jes 2 



and thus 



4 ^ E \Yj — npj\ + 2 E |lj - npj| + ^/n > na n + 



which proves the claim. 



□ 
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